9 research outputs found
Detecting Family Resemblance: Automated Genre Classification.
This paper presents results in automated genre classification of digital documents in PDF format. It describes genre classification as an important ingredient in contextualising scientific data and in retrieving targetted material for improving research. The current paper compares the role of visual layout, stylistic features and language model features in clustering documents and presents results in retrieving five selected genres (Scientific Article, Thesis, Periodicals, Business Report, and Form) from a pool of materials populated with documents of the nineteen most popular genres found in our experimental data set.
Automating Metadata Extraction: Genre Classification
A problem that frequently arises in the management and integration of scientific data is the lack of context and semantics that would link data encoded in disparate ways. To bridge the discrepancy, it often helps to mine scientific texts to aid the understanding of the database. Mining relevant text can be significantly aided by the availability of descriptive and semantic metadata. The Digital Curation Centre (DCC) has undertaken research to automate the extraction of metadata from documents in PDF([22]). Documents may include scientific journal papers, lab notes or even emails. We suggest genre classification as a first step toward automating metadata extraction. The classification method will be built on looking at the documents from five directions; as an object of specific visual format, a layout of strings with characteristic grammar, an object with stylo-metric signatures, an object with meaning and purpose, and an object linked to previously classified objects and external sources. Some results of experiments in relation to the first two directions are described here; they are meant to be indicative of the promise underlying this multi-faceted approach.
Examining Variations of Prominent Features in Genre Classification.
This paper investigates the correlation between features of three types (visual, stylistic and topical types) and genre classes. The majority of previous studies in automated genre classification have created models based on an amalgamated representation of a document using a combination of features. In these models, the inseparable roles of different features make it difficult to determine a means of improving the classifier when it exhibits poor performance in detecting selected genres. In this paper we use classifiers independently modeled on three groups of features to examine six genre classes to show that the strongest features for making one classification is not necessarily the best features for carrying out another classification.
Feature Type Analysis in Automated Genre Classification
In this paper, we compare classifiers based on language model, image, and stylistic features for automated genre classification. The majority of previous studies in genre classification have created models based on an amalgamated representation of a document using a multitude of features. In these models, the inseparable roles of different features make it difficult to determine a means of improving the classifier when it exhibits poor performance in detecting selected genres. By independently modeling and comparing classifiers based on features belonging to three types, describing visual, stylistic, and topical properties, we demonstrate that different genres have distinctive feature strengths.
Formulating representative features with respect to document genre classification
Genre classification (e.g. whether a document
is a scientific article or magazine article) is closely
bound to the physical and conceptual structure of document
as well as the level of depth involved in the text.
Hence, it provides a means of ranking documents retrieved
by search tools according to metrics other than
topical similarity. Moreover, the structural information
derived from genre classification can be used to locate
target information within the text. In previous studies,
the detection of genre classes has been attempted
by using some normalised frequency of terms or combinations
of terms in the document (here, we are using
term as a reference to words, phrases, syntactic
units, sentences and paragraphs, as well as other patterns
derived from deeper linguistic or semantic analysis).
These approaches largely neglect how the term is
distributed throughout the document. Here, we report
the results of automated experiments based on distributive
statistics of words in order to present evidence that
term distribution pattern is a better indicator of genre
class than term frequency.
Building a document genre corpus: a profile of the KRYS I corpus
This paper describes the KRYS I corpus, consisting of documents classified into 70 genre classes. It has
been constructed as part of an effort to automate document genre classification as distinct from topic
detection. Previously there has been very little work on building corpora of texts which have been classified
using a nontopical
genre palette. The reason for this is partly due to the fact that genre as a concept, is
rooted in philosophy, rhetoric and literature, and highly complex and domain dependent in its interpretation
([11]). The usefulness of genre in everyday information search is only now starting to be recognised and
there is no genre classification schema that has been consolidated to have applicable value in this direction.
By presenting here our experiences in constructing the KRYS I corpus, we hope to shed light on the
information gathering and seeking behaviour and the role of genre in these activities, as well as a way
forward for creating a better corpus for testing automated genre classification tasks and the application of
these tasks to other domains.
Variations of word frequencies in Genre classification tasks.
This paper examines automated genre classification of text documents and its role in enabling the effective management of digital documents by digital libraries and other repositories. Genre classification, which narrows down the possible structure of a document, is a valuable step in realising the general automatic extraction of semantic metadata essential to the efficient management and use of digital objects. In the present report, we present an analysis of word frequencies in different genre classes in an effort to understand the distinction between independent classification tasks. In particular, we examine automated experiments on thirty-one genre classes to determine the relationship between the word frequency metrics and the degree of its significance in carrying out classification in varying environments.
Implicit References to Citations: A study of astronomy papers.
The research in this paper presents results in the automatic
classification of pronouns within articles into those which refer to cited research and those which do not. It also discusses the automatic linking of pronouns which do refer to citations to their corresponding citations. The current study focussed on the pronoun {\it they} as used in papers in Astronomy journals. The paper describes a classifier trained on maximum entropy principles using features defined by the distance to preceding citations and the category of verbs associated to the pronoun under consideration.
Documentary genre and digital recordkeeping: red herring or a way forward?
The purpose of this paper is to provide a preliminary assessment of the utility of the genre concept for digital recordkeeping. The exponential growth in the volume of records created since the 1940s has been a key motivator for the development of strategies that do not involve the review or processing of individual documents or files. Automation now allows processes at a level of granularity that is rarely, if at all, possible in the case of manual processes, without loss of cognisance of context. For this reason, it is timely to revisit concepts that may have been disregarded because of a perceived limited effectiveness in contributing anything to theory or practice. In this paper, the genre concept and its employability in the management of current and archival digital records are considered, as a form of social contextualisation of a document and as an attractive entry point of granularity at which to implement automation of appraisal processes. Particular attention is paid to the structurational view of genre and its connections with recordkeeping theory